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The /c-means problem consists of finding k centers in that minimize the 
sum of the squared distances of all points in an input set P from to their 
closest respective center. Awasthi et. al. recently showed that there exists a 
constant e' > 1 such that it is NP-hard to approximate the A:-means objective 
within a factor of c. We establish that the constant e' is at least 1.0013. 


For a given set of points P C the k-means problem consists of finding a partition 
of P into k clusters (Ci,..., C^) with corresponding centers (ci,..., c^) that minimize 
the sum of the squared distances of all points in P to their corresponding center, i.e. the 
quantity 

k 


aig min > > 


X - a 


where || • || denotes the Euclidean distance. The /c-means problem has been well-known 
since the fifties, when Lloyd p557] developed the famous local search heuristic also 
known as the /c-means algorithm. Various exact, approximate, and heuristic algorithms 
have been developed since then. For a constant number of clusters k and a constant di¬ 
mension d, the problem can be solved by enumerating weighted Voronoi diagrams [IKI94| . 
If the dimension is arbitrary but the number of centers is constant, many polynomial¬ 
time approximation schemes are known. For example, [FTTn] gives an algorithm with 
running time 0{nd + In the general case, only constant-factor approxima¬ 

tion algorithms are known [JVDll lEMN"*"!)!] , but no algorithm with an approximation 
ratio smaller than 9 has yet been found. 

Surprisingly, no hardness results for the /c-means problem were known even as recently 
as ten years ago. Today, it is known that the fc-means problem is NP-hard, even for 
constant k and arbitrary dimension d |ADHP0^ IDasOSj and also for arbitrary k and 
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constant d |MNV09] . Early this year, Awasthi et. al. [ACKS15] showed that there exists 
a constant e' > 0 such that it is NP-hard to approximate the /c-means objective within 
a factor of 1 + eh They use a reduction from the Vertex Cover problem on triangle- 
free graphs. Here, one is given a graph G = {V, E) that does not contain a triangle, 
and the goal is to compute a minimal set of vertices S which covers all the edges, 
meaning that for any {vi,Vj) G E, it holds that Vi € S oi Vj € S. To decide if k 
vertices suffice to cover a given G, they construct a fc-means instance in the following 
way. Let bi = (0,..., 1,..., 0) be the ith vector in the standard basis of For an 

edge e = (uj, vj) G E, set Xe = bi + bj. The instance consists of the parameter k and the 
point set {xe \ e G E}. Note that the number of points is \E\ and their dimension is |V|. 

A relatively simple analysis shows that this reduction is approximation-preserving. A 
vertex cover 5 C 1/ of size k corresponds to a solution for fc-means where we have centers 
at {bi : Vi € S} and each point is assigned to a center in S' H {bi^bj} (which is 

nonempty because S is a vertex cover). In addition, it can also be shown that a good 
solution for fc-means reveals a small vertex cover of G when G is triangle-free. 

Unfortunately, this reduction transforms (l-l-e)-hardness for Vertex Cover on triangle- 
free graphs to (1 -|- e')-hardness for /c-means where e' = 0(-^) and A is the maximum 
degree of G. Awasthi et. al. [ACKSl.^ proved hardness of Vertex Cover on triangle- 
free graphs via a reduction from general Vertex Cover, where the best hardness result of 
Dinur and Safra [DSOh] has an unspecified large constant A. Furthermore, the reduction 
uses a sophisticated spectral analysis to bound the size of the minimum vertex cover of 
a suitably chosen graph product. 

Our result is based on the observation that hardness results for Vertex Cover on 
small-degree graphs lead to hardness of Vertex Cover on triangle-free graphs with the 
same degree in an extremely simple way. Combined with the result of Chlebfk and 
Chlebfkova |CCn6| that proves hardness of approximating Vertex Cover on 4-regular 
graphs within ~ 1.02, this observation gives hardness of Vertex Cover on triangle-free, 
degree-4 graphs without relying on the spectral analysis. The same reduction from 
Vertex Cover on triangle-free graphs to /s-means then proves APX-hardness of A:-means, 
with an improved ratio due to the small degree of G. 

1. Main Result 

Our main result is the following theorem. 

Theorem 1. It is NP-hard to approximate k-means within a factor 1.0013. 

We prove hardness of A:-means by a reduction from Vertex Cover on 4-regular graphs, 
for which we have the following hardness result of Chlebfk and Chlebfkova [CC06] . 

Theorem 2 ([CC06], see also Given a A-regular graph G = {V{G),E{G)), it is 
NP-hard to distinguish to distinguish the following cases. 

• G has a vertex cover with at most amm|H(G)| vertices. 

• Every vertex cover of G has at least amax\y{G)\ vertices. 
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Here, Umin = (2^4,fc+8)/(4/X4^fc+12) and Umax = (2^4,fc+9)/(4/X4^fc+12) with < 21.7. 
In particular, it is NP-hard to approximate Vertex Cover on degree-4: graphs within a 
factor of ioLmax!(^min) ^ 1.0192. 

Given a 4-regular graph G = {V{G),E{G)) for Vertex Cover with n := \V (G)| vertices 
and 2n edges, we first partition E{G) into Ei and E 2 such that lie'll = |i72| = \E{G)\/2 = 
n and such that the subgraph (y{G),E 2 ) is bipartite. Such a partition always exists: 
every graph has a cut containing at least half of the edges (well-known; see, e. g., [MUn5] l. 
Choose n of these cut edges for E 2 , let Ei be the remaining edges. We define G' = 
iy{G'),E{G')) by splitting each edge in Ei into three edges. Formally, G' is given by 


V{G') = V{G)\j\ U 

ye=(u,v)eEi 


E{G') = I y I UFI 2 . 

ye=(u,v)GEi 


Notice that V has n -|- 2n = 3n vertices and 3n -|- n = dn edges. It is also easy to see 
that the maximum degree of V is 4, and that V does not have any triangle, since any 
triangle of G contains at least one edge of Ei (because {V (G), -E 2 ) is bipartite) and each 
edge of El is split into three. 

Given G' as an instance of Vertex Cover on triangle-free graphs, the reduction to the 
fe-means problem is the same as before. Let = (0,..., 1,... , 0) be the ith vector in the 
standard basis of For an edge e = {vi,Vj) G E{G'), set Xg = + bj. The instance 

consists of the parameter k = {amin + and the point set {xg | e € E}. Notice that 
the number of points is now 4n and their dimension is 3n. 

We now analyze the reduction. Note that for A:-means, once a cluster is fixed as 
a set of points, the optimal center and the cost of the cluster are determinecfl. Let 
cost(G) be the cost of a cluster C. We abuse notation and use G for the set of edges 
{e : Xg G G} C E{G') as well. For an integer I, define an l-star to be a set of I distinct 
edges incident to a common vertex. The following lemma is proven by Awasthi et. al. 
and shows that if G is cost-efficient, then two vertices are sufficient to cover many edges 
in C. Furthermore, an optimal G is either a star or a triangle. 


Lemma 3 ( |ACKS15] . Proposition 9 and Lemma 11). Let G = {xg^,... ,Xg;} he a 
cluster. Then I — 1 < cost(C') <21 — 1, and there exist two vertices that cover at least 
[2/ — 1 — cost(C')] edges in G. Furthermore, cost(C') = 1 — 1 if and only if G is either 
an l-star or a triangle, and otherwise, cost(C') > I — 1/2. 


1.1. Completeness 

Lemma 4. If G has a vertex cover of size at most aminn, the instance of k-means 
produced by the reduction admits a solution of cost at most (3 — amin)n. 

^For fc = 1, the optimal solution to the fc-means problem i s the centroid of the point set. This is due 
to a well-known fact, see, e. g.. Lemma 2.1 in [KMN’^'Odj . 
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Proof. Suppose G has a vertex cover S with at most aminn vertices. For each edge 
e = {u,v) G El, let v'{e) = if u G S, and v'{e) = otherwise. Let S' := 
S U fJe&Ei{v'{e)}. Since S' is a vertex cover of G, for every edge e G Ei, S and v'{e) 
cover all three edges of E(G') corresponding to e. Therefore, S' is a vertex cover of G', 
and since lie'll = n, it has at most (amm + 1)^ vertices. 

For the /c-means solution, let each cluster correspond to a vertex in S', and assign each 
edge e G E{G') to the cluster corresponding to a vertex incident to e (choose an arbitrary 
one if there are two). Each edge is assigned to a cluster since S' is a vertex cover, and 
each cluster is a star by construction. Since there are 4n points and k = aminn + n, the 
total cost of the solution is, by Lemma El 

k k y k .. 

^cost(a) = J](|a|-1) = ( - k = (3 - amin)n. □ 

1.2. Soundness 

Lemma 5. If every vertex eover of G has size of at least amaxn, then any solution of the 
k-means instanee produced by the reduetion costs at least (3 — Omm + ^(amax — amin))n. 

Proof. Suppose every vertex cover of G has at least amaxn vertices. We claim that every 
vertex cover of G' also has to be large. 

Claim 6. Every vertex eover of G' has at least {amax + ^)n vertices. 

Proof. Let S' be a vertex cover of G'. If S' contains both Ug and u'^ for any e = {u, v) G 
El, then S' U {«} \ {Ug is a vertex cover with the same or smaller size. Therefore, we 
can without loss of generality assume that for each e = {u,v) G Ei, S' contains exactly 
one vertex in {Ug„,Ug^}. Set S := S' Cl V{G), thus S has cardinality |5'| — n. Each 
e G £'2 is covered by S by definition. If an e G Ei is not covered by S, at least one of the 
three edges of G' corresponding to e is not covered by S'. Thus, every edge e G E{G) is 
covered by S, so S' is a vertex cover of G. Since |S| > amaxn, |S'| > {amax + ^)n. □ 

Fix k clusters Ci,..., C^. Without loss of generality, let Ci,..., be clusters that 
correspond to a star, and Gs+i,... ,Ck be clusters that do not correspond to a star 
for any 1. For i = l,...,s, let v{i) be the vertex covering all edges in Gi, and for 
i = s + 1,..., A:, let v{i),v'{i) be two vertices covering at least |'2|C'j| — 1 — cost(C'j)] 
edges in Gi by LemmaEl Let C E{G') be the set of edges not covered by any v{i) or 
v'{i). The cardinality of |£^| is at most 

k k 

{\Ci\ - (2|Ci| - 1 - cost(Ci))) = ^ (cost(Ci) - {\Gi\- 1)). 

2=S+1 2 = 5 + 1 

Adding one vertex for each edge of to the set {u(i)}i<i<g U {u(i), u'(i)}s+i<i<A; 
yields a vertex cover of G' of size at most 

k 

s + 2{k- s)+ ^ (cost(C'i) - {\Gi\ - 1)). 

2=5+1 
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Every vertex cover of G' has size of at least {umax + 1)^ = k + {amax — C(min)n, so we 
have 

k 

{k- s) + ^ (cost(C'j) - {\Ci\ - 1)) 

^ i^max 

i=s+l 

Now, either k-s > \{amax-Oimin)n or Z]*^= 5 +i(cost(Ci)-(|Ci|-l)) > \{amax-amin)n. 
In the former case, since cost(Ci) > ICij — ^ for z > s by Lemma El the total cost is 

/t 5 /c / k 

^cost(Q)>^(iai-i)+ ^ (iai-i)> (^\Ci 

i=l i=l 2=s+l ^ i 



k k 

In the latter case, the total cost can be split to obtain that ^ cost(C'j) > (ICjl — 1) + 

i=l i=l 

k k 

Y, (cost(Ci) - {\Ci\ - 1)) > (I] \Ci\) -k + ^(amax - amin)n. Therefore, in any case, 

i=s-\-l i 

the total cost is at least 



k+^ia^ax amin)n 


3 CUmin T ^{oimax C^min) ) kl. 


□ 


The above completeness and soundness analyses show that it is NP-hard to distinguish 
the following cases. 

• There exists a solution of cost at most (3 — amin)n. 

• Every solution has cost at least (3 — Omm + 

Therefore, it is NP-hard to approximate A:-means within a factor of 


(3 OtfYiin T 


(3 


')^ _ 2 _j_ 


3(3 OijYiin) 


= 1 + 


3(10/i4^fc + 28) 


> 1.0013. 
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A. Remark on Theorem 2 


To obtain Theorem 2, note that the proof of Theorem 17 in |GG06| states that it is 
NP-hard to distinguish whether the vertex cover has at most 


2{\V{H)\-M{H))/k + 8 + 2s 
' 2\V(H)\/k + 12 


or at least |U(G)| 


2{\V{H)\- M{H))/k -t 9 -t 2e 
2\V{H)\/k-\-12 


vertices. By the assumption in the first sentence of the proof and because \V{H)\ = 
2M{H), {\V{H)\ — M{H))/k and \V{H)\/k can be replaced by as defined in Defi¬ 
nition 6 in [GG06| . By Theorem 16 in [GG06| . < 21.7. 
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